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Abstract 


We propose novel methods for max- 
cost Discrete Function Evaluation Problem 
(DFEP) under budget constraints. We are 
motivated by applications such as clinical 
diagnosis where a patient is subjected to a 
sequence of (possibly expensive) tests be¬ 
fore a decision is made. Our goal is to de¬ 
velop strategies for minimizing max-costs. 
The problem is known to be NP hard and 
greedy methods based on specialized impu¬ 
rity functions have been proposed. We de¬ 
velop a broad class of admissible impurity 
functions that admit monomials, classes 
of polynomials, and hinge-loss functions 
that allow for flexible impurity design with 
provably optimal approximation bounds. 
This flexibility is important for datasets 
when max-cost can be overly sensitive to 
“outliers.” Outliers bias max-cost to a few 
examples that require a large number of 
tests for classification. We design admis¬ 
sible functions that allow for accuracy-cost 
trade-off and result in 0(log n) guarantees 
of the optimal cost among trees with cor¬ 
responding classification accuracy levels. 


1 Introduction 

In many applications such as clinical diagnosis, mon¬ 
itoring, and web search, a patient, entity or query is 
subjected to a sequence of tests before a decision or 
prediction is made. Tests can be expensive and of¬ 
ten complementary, namely, the outcome of one test 
may render another redundant. The goal in these 
scenarios is to minimize total test costs with negli¬ 
gible loss in diagnostic performance. 


We propose to formulate this problem as an in¬ 
stance of the Discrete Function Evaluation Problem 
(DFEP). Under this framework, we seek to learn 
a decision tree which correctly classifies data while 
minimizing the cost of testing. We then propose 
methods to trade-ojf accuracy and costs. 

An instance of the problem is defined as / = 
{S,C,T,c); Here S = {si,...,s„} is the set of n 
objects; C = {Ci,..., Cm} is a partition of S into 
m classes; T is a set of tests; c is a cost function 
that assigns a cost c{t) > 0 for each test t G T. 
Applying test t G T ou object s G S will output a 
discrete value t{s) in a finite set of possible outcomes 
{!,...,?(}■ ^ is assumed to be complete in the sense 
that for any distinct Si,Sj G S there exists at gT 
such that t{si) ^ t{sj) so they can be distinguished 
by t. Given an instance of the DFEP, the goal is to 
build a testing procedure that uses tests in T to de¬ 
termine the class of an unknown object. Formally, 
any testing procedure can be represented by a de¬ 
cision tree, where every internal node is associated 
with a test and objects are directed from the root to 
the corresponding leaves based on the test outcomes 
at each node. Given instance I and decision tree D, 
the testing cost of s € 5', denoted as cost{D,s), is 
the sum of all costs incurred along the root-to-leaf 
path in D traced by s. We define the total cost as 


Cost\Y{D) = maxcost{D, s) 


This is known as the max-cost testing 
problem in the DFEP literature and has 
independently received significant atten¬ 
tion IGicalese et ah, 2014 Saettler et ah, 2014 


Moshkov, 2010[ Bellala et ah, 2012 due to the 
fact that in real world problems, the prior prob¬ 
ability used to compute the expected testing 
cost is either unavailable or inaccurate. Another 
motivation stems from time-critical applications, 
such as emergency response [Bellala et ah, 201^ , 
where violation of a time-constraint may lead to 
unacceptable consequences. 


















In this paper we propose novel approaches 
and themes for the max-cost DFEP problem. 
It is now well-known [Cicalese et ah, 2014| that 
O(logn) is the best approximation factor for 
DFEP unless P = NP. Greedy meth¬ 
ods that achieve O(logn) approximation guar¬ 
antee have been proposed 


Apart from the related approaches al¬ 
ready described above, our work is also 
related to those that generally deal with 
expected costs [Golovin and Kr ause, 201 1| 
[Golovin et ah, 20T0| [Bel 


ala et ah, 2012 or related 


problems such as sub-modular set coverage problem 


Cicalese et ah, 2014 Guillory and Bilmes, 2010 . At a conceptual level 


Saettler et ah, 20I4| [Moshkov, 2010 . These meth- the main difference in [Guillory and Bilmes, 2010 

ods often rely on judiciously engineering so called 
impurity functions that are surprisingly effec- 

Golovin and Krause, 2011[ Golovin et ah, 2010 

Bellala et ah, 2012[ is in the way tests are chosen 


tive in realizing “optimal” O(logn) guarantees. 
Authors in Cicalese et ah, 2014[ [Moshkov, 2010 


Saettler et ah, 2014[ describe impurity functions 
based on the notion of Pairs, while the authors 


in [Bellala et ah, 20T^ describe more complex im¬ 
purity functions but require distributional assump¬ 
tions. 

In contrast, we propose a broad class of admis¬ 
sible functions such that any function from this 
class can be chosen as an impurity function with 
an O(logn) approximation guarantee. Our admis¬ 
sible functions are in essence positive, monotone 
supermodular functions and admit not only pairs, 
monomials, classes of polynomials, but also hinge- 
loss functions. 

We propose new directions for the max-cost DFEP 
problem. In contrast to the current emphasis on cor¬ 
rect classification, we propose to deliberately trade¬ 
off cost with accuracy. This perspective can be 
justified under various scenarios. First, max-cost 
is overly sensitive to ‘‘‘'outliers” namely, a few in¬ 
stances require prohibitively many tests for correct 
classification. In these situations max-cost is not 
representative of most of the data and is biased to¬ 
wards a small subset of objects. Consequently, cen¬ 
soring those few “outliers” is meaningful from the 
perspective that max-cost applies to all but few ex¬ 
amples. Second many applications have hard cost 
constraints that supersede correct classification of 
the entire data set and the goal is a tree that guaran¬ 
tees these cost constraints while minimizing errors. 

Our proposed admissible functions are sufficiently 
general and allows for trading accuracy for cost. In 
particular we develop methods with O(logn) guar¬ 
antees of the optimal cost among trees with a cor¬ 
responding classification accuracy level. Moreover, 
we show empirically on a number of examples that 
selection of impurity functions plays an important 
role in this trade-off. In particular some admissible 
functions, such as hinge-loss are particularly well- 
suited for low-budgets while others are preferable in 
high-budget scenarios. 


Unlike our approach these methods employ utility 
functions in the policy space that acts on a se¬ 
quence of observations. [Golovin and Krause, 201 1| 
develops the notion of adaptive submodularity 
and has applied it for automated diagnosis. The 
proposed adaptive greedy algorithm can handle 
multiple classes/ test outcomes and arbitrary 
test costs but the approximation factor for the 
max-cost depends on the prior probability and 
can be very large in adversarial situations. A 
popular class of related approximation algorithms 
is generalized binary search (GBS) Dasgupta, 2004[ 


[Kosaraju et ah, 1999[ [Nowak, 2008 . A special case 
of this problem is where each object belongs to a 
distinct class and is known as object identification 
problem [Ghakaravarthy et ah, 20fT| or pool-based 


active learning Dasgupta, 2004 . When tests are 


restricted to binary outcomes and uniform test 
costs, 0(log(I/pmm)) approximation, where Pmin 
is the minimum probability of any single object 
Dasgupta, 2004[ can be obtained. Alternatively 


Gupta et ah, 2010] provides an algorithm which 


leads to an O(logn) approximation factor for the 
optimal expected cost with arbitrary test costs and 
binary test outcomes. With respect to the max-cost, 
[Hanneke, 2006[ gave a O(logn) approximation for 
multiway tests and arbitrary test costs. 


Organization: We present a greedy algorithm in 
Section which we show under general assumptions 
on the impurity function leads to an O(logn) ap¬ 
proximation of the optimal tree. We examine the 
assumptions on impurity functions and use them to 
define a class of admissible impurity functions in Sec¬ 
tion]^ Following this, we generalize from the error- 
free case to trade-off between max-cost and error in 
Section Finally, we demonstrate performance of 
the greedy algorithm on real world data sets in Sec¬ 
tion and show the advantage of different impurity 
functions along with the trade-off between error and 
max-cost. 






























































2 Greedy Algorithm and Analysis 

In this section, we present an analysis of the 
greedy algorithm GreedyTree. We first show that 
GreedyTree yields a tree whose max-cost is within 
0(log n) of the optimal max-cost for any DFEP. This 
bound on max-cost holds for any impurity function 
that satisfies a very general criteria as opposed to 
a fixed impurity function. In Section we exam¬ 
ine the assumptions on the impurity functions and 
present multiple examples of impurity functions for 
which this approximation bound holds. 

Before beginning the analysis, we first define the fol¬ 
lowing terms: for a given impurity function F, F{G) 
is the impurity function on the set of objects G; Dp 
is the family of decision trees with F{L) = 0 for 
any of its leaf L; OPT{S) is the minimum max- 
cost among all trees in Dp for the given input set of 
objects S; Gostp{S) is the max-cost of the tree con¬ 
structed by GreedyTree based on impurity func¬ 
tion F. 

Algorithm 1 GreedyTree 
1 : procedure GreedyTree (G,T) 

2: if F[G) — 0 then return 

3: for each test t G T do 

4: Compute R{t) := max , 

ieoutcomes^ (.Ot) 

5: where G\ is the set of objects in G 

6 : that has outcome i for test t. 

7: t ^ argmim R{t) 

8 : T ^ T\{i} 

9: for each outcome z of t do 

10: GREEDYTREE(Gj,T) 


For simplicity, we assume the impurity function 
takes on integer values and outcome-independent 
test costs. Note that integer valued impurity func¬ 
tions is not a limitation because of the discrete (fi¬ 
nite) nature of the problem - one can always scale 
any rational-valued impurity function to make it 
integer-valued. Similarly, it can be easily shown that 
our result extends to the outcome-dependent cost 
setting considered in [Saettler et ah, 201^ as well. 

Given a DFEP, GreedyTree greedily chooses the 
test with the largest worst-case impurity reduction 
until all leaves are pure, i.e. impurity equals zero. 
Let T be the first test selected by GreedyTree. By 
definition of the max-cost, 

Gostp{S) + ^f^CostpiSl) 

OPT{S) ~ OPT{S) ’ 
where SI is the set of objects in S that has out¬ 


come i for test t. Let q be such that Gostp{S'^) = 
maxGos<F(<S'®). We first provide a lemma to lower 

i 

bound the optimal cost, which will later be used to 
prove a bound on the cost of the tree. 

Lemma 2.1. Let F be monotone and supermodular, 
and T is the first test chosen by GreedyTree on 
the set of objects S, then 

c{t)F{S)/{F{S) - F{S'^)) < OPT{S). 

Proof. Let D* G Dp be a tree with optimal max- 
cost. Let V be an arbitrarily chosen internal node 
in D*, let 7 be the test associated with v and let 
i? C S' be the set of objects associated with the 
leaves of the subtree rooted at v. Let i be such that 
c(r)/(F(S) — F(S*)) is maximized and j be such 
that c{j)/{F{S) — F{Sf )) is maximized. We then 
have: 

c('r) ^ c{t) 

F{S) - F{Sf) - F{S) - F(S*) 

< _ . . ( 1 ) 

F(S) - F{Sf) F{R) - F{Rf) 

The first inequality follows from the definition of 
i. The second inequality follows from the greedy 
choice at the root. To show the last inequality, 
we have to show F{S) — F{Sf) > F{R) — F{Rf). 
This follows from the fact that Sf U R C S and 
Rf = Sf n R and therefore F{S) > F{Sf U R) > 
F{Sf) + F{R) — F{Rf), where the first inequality fol¬ 
lows from monotonicity and the second follows from 
the definition of supermodularity. 


Eor a node v, let S{v) be the set of objects associ¬ 
ated with the leaves of the subtree rooted at v. Let 
ui, U 2 ,..., Up be a root-to-leaf path on D* as follows: 
Vi is the root of the tree, and for each i = 1,... ,p—l 
the node Vi+i is a child of Vi associated with the 
branch of j that maximizes c(ti)/{F{S) — F{SfJ), 
where ti is the test associated with Vi. If follows 
from Q that 

[F(.5(u.))-F(5(u,+i))]c(r) ^ 

F{s)-F{s$y - *'■ 

Since the cost of the path from vi to Vp is no larger 
than the max-cost of the D*, we have that 

p-i 

OPT{S) > Y, cu 

^ - F{Siv.^P, 

c{T){FiS)-F{S{vp)) cir)FiS) 
F{S)-F{Sf) F(S)-F{Sf)' 

□ 
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Using Lemma 2.1 we can now state the main the¬ 
orem of this section which bounds the cost of the 
greedily constructed tree. 


Theorem 2.2. GreedyTree constructs a decision 
tree achieving O (log n)-factor approximation of the 
optimal max-cost in Dp on the set S of n objects 
if F is non-negative, monotone, supermodular with 
log(F(5)) = 0 (logn). 


Proof. 


GostpiS) 

c(t) -|- GostF(Sf.) 

( 3 ) 

OPT{S) 

OPT{S) 

^ c(r) 

GostpiSf) 

( 4 ) 

^ OPT{S) 

^ OPT{Si) 

^ F{S) - F{Sl) , GostpiSl) 

( 5 ) 

F{S) 

OPT {Si) 

, , FiS) 

)+\og{F{Sl)) + l 

(6) 

= log(F(F)) + l = 0 (logn). 

( 7 ) 


The inequality in Q follows from the fact that 
OPT{S) > OPT{Sf). ® follows from Lemma 


ity ify ^ log(l -I- a:) for x > —1 and the second 
term follows from the induction hypothesis that for 
each G C S, CostF{G)/OPT[G) < log(F(G)) -h 1. 
If F{G) = 0 for some set of objects G, we define 
CostF{G)/OPT{G) = 1. 

We can verify the base case of the induction as fol¬ 
lows. if F{G) = P, which is the smallest non-zero 
impurity of F on subsets of objects S, we claim that 
the optimal decision tree chooses the test with the 
smallest cost among those that can reduce the im¬ 
purity function F: 


2.1 The first term in (^ 6 | follows from the inequal¬ 


OPT{G) = min c{t). 

11 F’(Gj)= 0,V2G outcomes 


Suppose otherwise, the optimal tree chooses first a 
test t with a child node G' such that F{G') = /? 
and later chooses another test t' such that all the 
child nodes of G' by t' has zero impurity, then t' 
could have been chosen in the first place to reduce 
all child nodes of G to zero impurity by supermodu¬ 
larity of F and therefore this cannot be the optimal 
ordering of tests. On the other hand, R{t) = oo in 
GreedyTree for those test t that cannot reduce 
impurity and R(t) = c(t) for those tests that can. 
So the algorithm would pick the test among those 
that can reduce impurity and have the smallest cost. 
Thus, we have shown that GostF{G)/OPT{G) = 
log(F(G)) -I- 1 = 1 for the base case. □ 


Given that P NP, the optimal order approxi¬ 
mation for the DFEP problem is O(logn), which 
is achieved by GreedyTree. This approximation 
is not dependent on a particular impurity function, 
but instead holds for any function which satisfies the 
assumptions. In Section]^ we define a family of im¬ 
purity functions that satisfy these assumptions. 


3 Admissible Functions 

A fundamental element of constructing decision 
trees is the impurity function, which measures the 
disagreement of labels between a set of objects. 
Many impurity functions have been proposed for 
constructing decision trees, and the choice of im¬ 
purity function can have a significant impact on the 
performance of the tree. In this section we examine 
the assumptions placed on the impurity function by 
Lemma 12.11 and Theorem 12.21 which we use to de¬ 
fine a class of functions we call admissible impurity 
functions and provide examples of admissible impu¬ 
rity functions. 

Definition A function F of a set of objects is 
admissible if it satisfies the following five proper¬ 
ties: (1) Non-negativity: F(G) > 0 for any set 
of objects G; (2) Purity: F{G) = 0 if G con¬ 
sists of objects of the same class; (3) Monotonic¬ 
ity: F{G) > F{R),yR C G; (4) Supermodularty: 
F(GU j) - F{G) > F{RUj) - F{R) for any i? C G 
and object j ^ i?; (5) log(F(S')) = O(logn). 

A wide range of functions falls into the class of ad¬ 
missible impurity functions. We propose a general 
family of polynomial functions which we show is ad¬ 
missible. Given a set of objects G, Uq denotes the 
number of objects in G that belong to class i. 

Lemma 3.1. Suppose there are k classes in G. Any 
polynomial function ofn ^,..., Uq with non-negative 
terms such that n ^,..., Uq do not appear as single- 
ton terms is admissible. Formally, if 

M 

F{G) (8) 

i=l 

where 7i ’s are non-negative, pij’s are non-negative 
integers and for each i there exists at least 2 non¬ 
zero Pij ’s, then F is admissible. 


Proof. Properties (1),(2),(3) and (5) are obviously 
true. To show F is supermodular, suppose R C G 
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and object j ^ R and j belongs to class j, we have 
F{RUj)-FiR) 

= + 1 )^'' ■ ■ • i^Rf''‘- 

< ... (n^ + 1)P‘^ ... (n^)P-'=- 

iG/j 

= F{G\Jj)-F{G), 

where the first summation index set Ij is the set 
of terms that involve n'^. The inequality follows 
because + 1)^’^ can be expanded so the nega¬ 
tive term can be canceled, leaving a sum-of-products 
form for i?, which is term-by-term dominated by 
that of G. □ 


A special case of polynomial impurity func¬ 
tion is the previously proposed Pairs function 
P{G) [Saettler et ah, 201^ [Cicalese et ah, 20T^ 
[Moshkov, 2010 . Two objects (si,S 2 ) are defined as 
a pair if they are of different classes, with the Pairs 
function P{G) equal to the total number of pairs in 
the set G: 

P{G) = 

i—1 


where k is the number of distinct classes in set G. 

Corollary 3.2. The Pairs impurity function is ad¬ 
missible. 


As a corollary of Theorem |2.2| and Corollary |3.2[ 
we see that O(logn) approximation for Pairs and 
outcome-dependent cost holds for multiple test out¬ 
comes as well, extending the binary outcome setting 
shown in [Saettler et ah, 2014| . 

Another family of admissible impurity functions is 
the Powers function. 

Corollary 3.3. Powers function 

k k 

F(G) = (j2Rhy-E(^Gy ( 9 ) 

i=l i=l 

is admissible for Z = 2, 3,.... 


of these Powers impurity function in GreedyTree 
results in an error-free tree with near optimal cost. 

Another interesting admissible impurity used in Sec¬ 
tion is the hinged-Pairs function defined: 

P^G) = YIYRg - <A+[n^G - g]+ - Oi^] + , ( 10 ) 

where [a:]+ = max(a:, 0). This function differs from 
the Powers impurity function due to the fact that 
for a a > 0, the function Pa{G) = 0 need not imply 
that all objects in G belong to the same class. In 
the next section, we will discuss how this allows for 
trees to be constructed incorporating classification 
error. We include the proof of the following lemma 
in the Appendix. 

Lemma 3.4. In the multi-class setting, Pa{G) is 
admissible. 

Impurity Function Selection: While all admis¬ 
sible impurity functions enjoy the O(logn) approxi¬ 
mation of the optimal max-cost, they lead to differ¬ 
ent trees depending on the problems. To illustrate 
this point, consider the toy example in Figure [T] A 
set G has 30 objects in class 1 (circles) and 30 ob¬ 
jects in Class 2 (triangles). Two tests ti and t 2 are 
available to the algorithm. Test ti separates 20 ob¬ 
jects of Class 2 from the rest of the objects while 
t 2 evenly divides the objects into halves with equal 
number of objects from Class 1 and Class 2 in either 
half. Intuitively, ^2 is not a useful test from a clas¬ 
sification point of view because it does not separate 
objects based on class at all. This is reflected in the 
right plot of Figure choosing t 2 increases cost but 
does not reduce classification error while choosing 
ti reduces the error to |. If the impurity function 
chosen is the Pairs function, test t 2 will be chosen 
due to the fact that Pairs biases towards tests with 
balanced test outcomes. In contrast, the hinged- 
Pairs function leads to test ti, and therefore may be 
preferable in this case (for more details on this ex¬ 
ample see the Appendix). Although both impurity 
functions are admissible and return trees with near 
optimal guarantees, empirical performance can differ 
greatly and is strongly dependent on the structure of 
the data. In practice, we find that choosing the tree 
with the lowest classification error across a variety 
of impurity functions yields improved performance 
compared to a single impurity function strategy. 


Note Pairs can be viewed as a special case of Powers 
function when I = 2. An important property of the 
Powers impurity functions is the fact that for any 
power I, the function is zero only if the set of objects 
all belong to the same class. As a result, using any 


4 Trade-off Bounds 

Up to this point, we have focused on constructing 
error-free trees. Unfortunately, the max-cost crite¬ 
ria is highly sensitive to outliers, and therefore often 
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Lemma 4.3. For a multi-class input set S with k 
classes, the classification error of any tree in Dp^ xs 
with I leaves is bounded by k(k — l)le, where we set 
a = en. 


Proof. Suppose j is the largest class in leaf L. For 
i 7 ^ j, if > a, we have max{nfn^^ — a{n]^ + 
= 0, which implies < anp- So 


Figure 1: Illustration of different impurity functions 
for different greedy choice of tests. The left two fig¬ 
ures above show the test outcomes of test ti and t 2 - 
The right figure shows the classification error against 
cost (number of tests). Here using Pairs leads to 
choosing t 2 because it prefers balanced splits; using 
the hinged-Pairs leads to choosing ti, which is better 
from an error-cost trade-off point of view. 


yields trees with unnecessarily large maximum depth 
to accommodate a small subset of outliers in the data 
set. Refer to the synthetic experiment in Section 5 
for such an example. To overcome the sensitivity 
to outliers, we present an approach to constructing 
near optimal trees with non-zero error rates. 

Early-stopping: Instead of requiring all leaves to 
have zero impurity (F(L) = 0) in Algorithm]^ we 
can stop the recursion as soon as all leaves have im¬ 
purity below a threshold 6 {F{L) < S). This will 
allow error and cost trade-off. Let Dp-s denote the 
set of trees with F{L) < 6 for all leaves L and let 
OPTp.s{S) denote the optimal max-cost among all 
trees in Dp-s- 

Similar to the error-free setting, the O(logn) ap¬ 
proximation of the optimal max-cost still holds for 
early stopping as shown next. The proofs of Lemma 
14.11 and Theorem 14.21 are similar to that of Lemma 
O and Theorem 12.21 and we include them in the 
Appendix. 

Lemma 4.1. Let F be an admissible function and 
T is the first test chosen by GreedyTree on the set 
of objects S, then 

c(r)(F(5) - S)/{FiS) - F{S'^)) < OPTp,s{S). 

Theorem 4.2. GreedyTree constructs a decision 
tree achieving O (log n)-factor approximation of the 
optimal max-cost in Dp-s on the set S of n objects 
if F is admissible. 


Hinged-Pairs: Similar to early-stopping, we can 
also use the hinged-Pairs Pa (101 with a > 0 in 
GreedyTree to allow error-cost trade-off. We first 
establish an error upper bound for trees in Dp^.q. 


, fcn r n r 

Ur < - <ka = ken. 

np 

If R-l < we have n\ < en < ken. So for any leaf 
V - - 

L we have ———- < k{k — l)e. The overall error 
bound thus follows. □ 


Often in practice a tree may contain a relatively 
large number of leaves but only a small fraction of 
them contain most of the objects. A more refined 
upper bound on the error is given by the following 
lemma, which we prove in the Appendix. 

Lemma 4.4. Consider a multi-class input set S 
with k classes and a = en. For any tree T G Dp^.^ 
with I leaves, given any rj G [0,1], let Ir^ be the 
smallest integer such that the largest leaves of 
T have more than 1 — rj of the total number of ob¬ 
jects n. Then the classification error is bounded by 
k{k - l)lne -G ^rj. 

Denote Dp.,, as the class of trees with classification 
error less than or equal to e on the set of input S. 
We can further derive a useful relation between Dp.,, 
and Dp^^.Q. 

Lemma 4.5. For any multi-class input set S with 
k classes, Dp,,, C Dp^^,o C Dp,p(^k-i)d- 


Proof. To show Dp,,, C Dp^^, for any tree T G Dp,,,, 
we have X]i=i F £n, where I is the number of 
leaves and hpi is the number of objects in leaf Li that 
are not from the majority class: hp = np — 

This implies hp < en for all leaves of T. Suppose j 
is the class with most number of objects in leaf L: 
n^P = nfj°'^. It is not hard to see for any class i ^ j 


nl + n{ 


<n\< en, 


which implies — a] + [n^ — oi\p — oP‘]+ = 0. Thus 
wehaveF(L) = Ep/JK-«]-tK-a]-7-a^]+ = 0. 
Thus Dp,„ C Pp ^^.Q. Dp^^,o C Dp.p!^p_i)„i follows 
from Lemma 14.31 □ 
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The main theorem of this section is the following. 

















of outliers affecting max-cost. The left and right 
figures above show the test outcomes of test ti and 
^ 2 , respectively. 

Theorem 4.6. In multi-class classification with 
k classes, if T is the decision tree returned by 
GreedyTree using hinged-Pairs (setting a = en) 
applied on the set S of n objects, then we have the 
following: 



Worst Testing Cost 


Figure 3: The error-cost trade-off plot of the Algo¬ 
rithm 1 using Pairs on the synthetic example. 0.39% 
error can be achieved using only a depth-2 tree but 
it takes a depth-10 tree to achieve zero error. 


Costp^{S) < 0(\ogn)OPTp^,o{S) 

< 0{logn)OPTE-.eiS). 

Proof. The first inequality follows from Theorem |2.2| 
and the second inequality follows from Lemma [4.5[ 

□ 

The above theorem states that for a given error pa¬ 
rameter e, a greedy tree can be constructed using 
hinged-Pairs Pa by setting a = en, with the max- 
cost guaranteed to be within an O(logu) factor of 
the best possible max-cost among all decision trees 
that have classification error less than or equal to 
e. To our knowledge this is the first bound relating 
classification error to cost, which provides a theoret¬ 
ical basis for accuracy-cost trade-off. 

5 Experimental Results 

We first demonstrate the effect of outliers using a 
simple synthetic example, where a small set of out¬ 
liers dramatically increases the max-cost of the tree. 
We show that allowing a small number of errors in 
the tree drastically reduces the cost of the tree, al¬ 
lowing for efficient trees to be constructed in the 
presence of outliers. Next, we demonstrate the abil¬ 
ity to construct decision trees on real world data 
sets. We observe a similar behavior to the synthetic 
data set on many of these data sets, where allowing 
a small amount of error results in trees with signifi¬ 
cantly lower cost. Additionally, we see the effect of 
impurity function choice on performance of the trees. 
For all real datasets, we present performance of the 
Powers impurity function presented in Eq. 0 with 


^ = 2, 3,4, 5 and error introduced by early stopping 
as well as the hinged-Pairs impurity function pre¬ 
sented in Eq. (10) with error introduced by varying 
the parameter a. 


Synthetic Example: Here we consider a multi¬ 
class classification example to demonstrate the ef¬ 
fect a small set of objects can have on the max-cost 
of the tree. Consider a data set composed of 1024 
objects belonging to 4 classes with 10 binary tests 
available. Assume that the set of tests is complete, 
that is no two objects have the same set of test out¬ 
comes. Note that by fixing the order of the tests, 
the set of test outcomes maps each object to an in¬ 
teger in the range [0,1023]. From this mapping, we 
give the objects in the ranges [1,255] , [257,511] , 
[513,767], and [769,1023] the labels 1, 2, 3, and 4, 
respectively, and the objects 0, 256, 512, and 768 the 
labels 2, 3, 4, and 1, respectively (Figure shows 
the data projected to the first two tests). Suppose 
each test carries a unit cost. By Kraft’s Inequal¬ 
ity [Cover and Thomas, 1991] , the optimal max-cost 
in order to correctly classify every object is 10, how¬ 
ever, using only ti and t 2 as selected by the greedy 
algorithm, leads to a correct classification of all but 
4 objects, as shown in Figure]^ For this type of data 
set, a constant sized set of costs can change from a 
tree with a constant max-cost to a tree with a log n 
max-cost. 


Data Sets: We compare performance us¬ 
ing 9 data sets from the UCI Repository 


Frank and Asuncion, 2010 . We assume that all 


tests (features) have a uniform cost. For each data 
set, we replace non-unique objects with a single in- 
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Figure 4: Comparison of classification error vs. max-cost for the Powers impurity function in Q for I = 
2, 3,4, 5 and the hinged-Pairs impurity function in (10). Note that for both House Votes and WBCD, the 
depth 0 tree is not included as the error decreases dramatically using a single test. In many cases, the hinged 
pairs impurity function outperforms the Powers impurity functions for trees with smaller max-costs, whereas 
the Powers impurity function outperforms the hinged-Pairs function for larger max-costs. 


stance using the most common label for the objects, 
allowing every data set to be complete (perfectly 
classified by the decision trees). Additionally, con¬ 
tinuous features are transformed to discrete features 
by quantizing to 10 uniformly spaced levels. More 
details on the data sets used can be found in the 
Appendix. 

Error vs. Cost Trade-OfF: Fig. shows the 
trade-off between classification error and max-cost, 
which suggest two key trends. First, it appears 
that many data sets, such as house votes, Statlog 
DNA, Wisconsin breast cancer, and mammography, 
can be classified with minimal error using few tests. 
Intuitively, this small error appears to correspond 
to a small subset of outlier objects which require a 


large number of tests to correctly classify while the 
majority of the data can be classified with a small 
number of tests. Second, empirical evidence sug¬ 
gests that the optimal choice of impurity function 
is dependent on the desired max-cost of the tree. 
For trees with a smaller budget (and therefore lower 
depth), the hinged-Pairs impurity function outper¬ 
forms the Powers impurity function with early stop¬ 
ping, whereas for larger budget (and greater depth), 
the Powers impurity function outperforms hinged- 
Pairs. This matches our intuitive understanding of 
the impurity functions, as the Powers impurity func¬ 
tion biases towards tests which evenly divide the 
data whereas hinged-Pairs puts more emphasis on 
classihcation performance. 






















































































6 Conclusion 

We characterize a broad class of admissible impurity 
functions that can be used in a greedy algorithm to 
yield O(logn) guarantees of the optimal max-cost. 
We give examples of such admissible functions and 
demonstrate that they have different empirical prop¬ 
erties even though they all enjoy the O(logn) guar¬ 
antee. We further design admissible functions to al¬ 
low for accuracy-cost trade-off and provide a bound 
relating classification error to cost. Finally, through 
real world datasets we demonstrate that our algo¬ 
rithm can indeed censor the outliers and achieve high 
classification accuracy using low max-cost. To visu¬ 
alize such outliers we construct a 2-D synthetic ex¬ 
periment and show our algorithm successfully iden¬ 
tifies these as outliers. 
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Appendix 

Proof of Lemma 3.4 Before showing admissibility of the hinged-Pairs function in the multiclass setting, 
we first show Pa{G) is admissible for the binary setting. 

Lemma 6.1. Consider the binary classification setting, let 

Pc{G) = [[n^ - a] + [n% - a]+ - a^] + , 
where = max(a::,0). Pa{G) is admissible. 

Proof. All the properties are obviously true except supermodularity. To show supermodularity, suppose 
RC G and object j ^ R. Suppose j belongs to the first class. We need to show 

P„(GU j) - P„(G) > P„(i?U j) - Po,{R). (11) 


Consider 3 cases: 

(1) Pa{R) = PaiRC j) = 0: The right hand side of 0 is 0 and ( [TT] ) holds because of monotonicity of Pa- 

(2) Pa{R) = 0, Pa{R U j) > 0, Pa(G) = 0: ( [IT| ) reduces to Pa{G U j) > Pa{RUj), which is true by 
monotonicity. 

(3) Pa{R) = G,Pa{RC j) > 0,Pa(G) > 0: Note that Pa{G) > 0 implies that [uq — a] + [nQ — a]+ — > 0 

which further implies Hq > a,n^ > a. Thus the left hand side is 

Pa{Gyjj) - Pa{G) = (ng - a + l)(ng - a) - - ((n^ - OL){nQ - a)- a^) = - a. 

The right hand side is 

Pa(P U j) = {n]i-a + l)(n^ - a) - - Q:)(n|j - a) - + {n\ - a). 

If > a, Pa{R) = max((n)j — a){n\ — a) — a^,0) = 0 because Pa{R U j) >0 implies n|j > a. So 
Pa(PU j) < n\ - a < Uq - a = Pa(GU j) - P„(G). 

(4) P„(P) > 0: We have 

Pa{GCij) - Pa{G) = nQ-a>n\-a = Pq(PU j) - Pa(P). 

This completes the proof. □ 

Now we are ready to generalize from the binary hinged-Pairs function to the multiclass hinged-Pairs function. 
Again, all properties are obviously except supermodularity. The supermodularity follows from the fact that 
each term in the sum is supermodular according to Lemma |6.1| 

Proof of Lemma 4.4 We begin by considering any leaf L of P, suppose j is the largest class in L. For 
i j, if nf > a, we have 

[[nf - a\+[n{ - a\+- Q^]+ 

= max(n)^n'j^ — a{n^i^ = 0 

, which implies < auL- So 

- kn\ ni 

Ur < - — < ka = ken. 

- riL - 

If have nf < en < ken. Let he the number of objects in leaf L that are not from the majority 

class: fiL = nL — n]^. So for any leaf L we have ^ —- < k{k — l)e. 

Now we enumerate the leaves of T in non-increasing order according to the number of objects they contain. 
Let A be the set of the first leaves. By definition of the total number of objects contained in A is 
n .4 > (I — r])n. 


10 




The overall error bound is obtained by considering leaves in A and the complement A separately: 

Elea ^L +Elea ^ k(k - l)elr,n + ^rjn 
n ~ n 

k — \ 

= k(k - l)lr^e H- —T], 

where we have used the fact that and that Elea'll < 

Details of Computation in Figure 1 If Pairs is used, we can compute impurity of each set of interest: 
P{G) = 30 X 30 = 900,P(Gt\) = 30 x 10 = 300,P(G?J = 0,P(GtM = P{G\) = 15 x 15 = 225; according 
to Algorithm Q we can compute R{ti) = maxIg^Q^, gg^} = = max{ggg^, ggg^ = gfe} 

so <2 will be mosen. On the other hand, the impurities for the hinged-Pairs with 0 = 8 are Pq(G) = 
22 X 22 = 484,P„(Gt\) = 22 x 2 = 44,P„(G?J = Q,PEG\^) = Pa(G?J = 7 x 7 = 49; again we can compute 
R{ti) = max{ ^g/_^^ , = Ag,P(t 2 ) = max{ 434 ^ 49 , 434-49 = iki ^e chosen. The above 

example shows that Pairs has a stronger preference to balanced tests and may in some cases lead to poor 
classification result. 

Details of Data Sets The house votes data set is composed of the voting records for 435 members of 
the U.S. House of Representatives (342 unique voting records) on 16 measures, with a goal of identifying 
the party of each member. The sonar data set contains 208 sonar signatures, each composed of energy 
levels (quantized to 10 levels) in 60 different frequency bands, with a goal of identifying The ionosphere 
data set has 351 (350 unique) radar returns, each composed of 34 responses (quantized to 10 levels), with 
a goal of identifying if an event represents a free electron in the ionosphere. The Statlog DNA data set is 
composed of 3186 (3001 unique) DNA sequences with 180 features, with a goal of predicting whether the 
sequence represents a boundary of DNA to be spliced in or out. The Boston housing data set contains 13 
attributes (quantized to 10 levels) pertaining to 506 (469 unique) different neighborhoods around Boston, 
with a goal of predicting which quartile the median income of the neighborhood the neighborhood falls. The 
soybean data set is composed of 307 examples (303 unique) composed of 34 categorical features, with a goal 
of predicting from among 19 diseases which is afflicting the soy bean plant. The pima data set is composed of 
8 features (with continuous features quantized to 10 levels) corresponding to medical information and tests 
for 768 patients (753 unique feature patterns), with a goal of diagnosing diabetes. The Wisconsin breast 
cancer data set contains 30 features corresponding to properties of a cell nucleus for 569 samples, with a 
goal of identifying if the cell is malignant or benign. The mammography data set contains 6 features from 
mammography scans (with age quantized into 10 bins) for 830 patients, with a goal of classifying the lesions 
as malignant or benign. 
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